Supervised Learning
Question: Build a decision tree classifier for the type of each storm in nasaweather::storms based on its wind speed and pressure. Report its accuracy.
We will create a decision tree classifier to fit the model. Let’s start by inspecting the dataframe and plot a scatterplot of each of the storm types.
head(nasaweather::storms)## # A tibble: 6 × 11
## name year month day hour lat long pressure wind type seasday
## <chr> <int> <int> <int> <int> <dbl> <dbl> <int> <int> <chr> <int>
## 1 Allison 1995 6 3 0 17.4 -84.3 1005 30 Tropical D… 3
## 2 Allison 1995 6 3 6 18.3 -84.9 1004 30 Tropical D… 3
## 3 Allison 1995 6 3 12 19.3 -85.7 1003 35 Tropical S… 3
## 4 Allison 1995 6 3 18 20.6 -85.8 1001 40 Tropical S… 3
## 5 Allison 1995 6 4 0 22 -86 997 50 Tropical S… 4
## 6 Allison 1995 6 4 6 23.3 -86.3 995 60 Tropical S… 4
storms <- nasaweather::stormsggplot(data = nasaweather::storms, aes(x = pressure, y = wind)) +
geom_point(data = nasaweather::storms %>% select(-type), alpha = .25, color = "grey") +
geom_point(alpha = .25, position = position_jitter(.1), color = "green") +
facet_wrap(vars(type))Since the data is relatively well clustered, we can fit it into a decision tree to help us classify them. We will use wind and pressure in our model.
#Create model based on wind and pressure
model1 <- fit(decision_tree(mode = "regression"),
type ~ wind + pressure,
data = storms)
#fit model into a decision tree
model1 %>%
extract_fit_engine() %>%
rpart.plot::rpart.plot(roundint = FALSE, digits = 2, type = 5)Clustering
Question Consider the 4,000 biggest cities in the world, given by:
big_cities <- mdsr::world_cities %>%
arrange(desc(population)) %>%
slice_head(n = 4000)
head(big_cities)## # A tibble: 6 × 9
## geoname_id name latitude longitude country country_region population timezone
## <dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl> <chr>
## 1 1796236 Shan… 31.2 121. CN 23 22315474 Asia/Sh…
## 2 745044 Ista… 41.0 28.9 TR 34 14804116 Europe/…
## 3 3435910 Buen… -34.6 -58.4 AR 07 13076300 America…
## 4 1275339 Mumb… 19.1 72.9 IN 16 12691836 Asia/Ko…
## 5 3530597 Mexi… 19.4 -99.1 MX 09 12294193 America…
## 6 1816670 Beij… 39.9 116. CN 22 11716620 Asia/Sh…
## # … with 1 more variable: modification_date <date>
Construct a k-means clustering of the latitude and longitude of these cities. Describe (qualitatively) the results of clustering with k=2.
Let’s start by splitting our data into our training and testing data with a 80-20 split.
set.seed(12345)
big_cities_split <- initial_split(big_cities, prop = 4/5)
big_cities_train <- training(big_cities_split)
big_cities_test <- testing(big_cities_split)
head(big_cities_train)## # A tibble: 6 × 9
## geoname_id name latitude longitude country country_region population timezone
## <dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl> <chr>
## 1 2650839 Dudl… 52.5 -2.08 GB ENG 199059 Europe/…
## 2 1805753 Jinan 36.7 117. CN 25 4335989 Asia/Sh…
## 3 2034714 Sipi… 43.2 124. CN 05 555609 Asia/Sh…
## 4 515003 Oren… 51.8 55.1 RU 55 550204 Asia/Ye…
## 5 963241 Rand… -26.1 28.0 ZA 06 337053 Africa/…
## 6 1859675 Kawa… 34.8 135. JP 13 160520 Asia/To…
## # … with 1 more variable: modification_date <date>
Next, we need to isolate our latitude and longitude columns and fix their scale between 0 and 1. After that, we will run our kmeans model with 2 centers.
cluster <- big_cities_train %>%
select(latitude, longitude) %>%
mutate(latitude = rescale(latitude, to = c(0,1))) %>%
mutate(longitude = rescale(longitude, to = c(0,1)))
head(cluster)## # A tibble: 6 × 2
## latitude longitude
## <dbl> <dbl>
## 1 0.865 0.468
## 2 0.735 0.825
## 3 0.789 0.847
## 4 0.859 0.639
## 5 0.222 0.558
## 6 0.720 0.880
set.seed(12345)
clustering_results <- cluster %>%
kmeans(nstart = 10, centers = 2)
big_cities_clusters <- big_cities_train %>%
mutate(cluster = as.factor(clustering_results$cluster))Looking at our kmeans clustering results:
glance(clustering_results)## # A tibble: 1 × 4
## totss tot.withinss betweenss iter
## <dbl> <dbl> <dbl> <int>
## 1 267. 142. 126. 1
tidy(clustering_results)## # A tibble: 2 × 5
## latitude longitude size withinss cluster
## <dbl> <dbl> <int> <dbl> <fct>
## 1 0.505 0.308 1018 60.6 1
## 2 0.699 0.686 2182 81.1 2
head(big_cities_clusters)## # A tibble: 6 × 10
## geoname_id name latitude longitude country country_region population timezone
## <dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl> <chr>
## 1 2650839 Dudl… 52.5 -2.08 GB ENG 199059 Europe/…
## 2 1805753 Jinan 36.7 117. CN 25 4335989 Asia/Sh…
## 3 2034714 Sipi… 43.2 124. CN 05 555609 Asia/Sh…
## 4 515003 Oren… 51.8 55.1 RU 55 550204 Asia/Ye…
## 5 963241 Rand… -26.1 28.0 ZA 06 337053 Africa/…
## 6 1859675 Kawa… 34.8 135. JP 13 160520 Asia/To…
## # … with 2 more variables: modification_date <date>, cluster <fct>
Now that our data has been properly formatted to visualize, let’s start by plotting a scatterplot.
#plot scatterplot
latlong_plot <-
ggplot(big_cities_clusters, aes(x = longitude, y=latitude, color = cluster))+
geom_point(alpha = 0.5)+
coord_fixed(ratio = 1)
latlong_plot But how do we know where on the globe these cities are? Let’s overlay this scatterplot onto a world map using plotly.
#overlay scatterplots over a world map using plotly
map_plot <- plot_geo(big_cities_clusters, lat = ~latitude, lon = ~longitude, color = ~cluster) %>%
add_trace(marker = list(opacity = 0.7))
map_plotAs we can see from our map, using a k-means clustering with k=2, the plots are broken roughly into 2 clusters, one containing North America, South America, and Africa, and the other cluster containing Europe, Asia, and Australia. There is some overlap within Africa, divided into the North and South, with the North clustered with Europe, and the South clustered with the Americas.
Databases
Which baseball players have hit 500 home runs (HR) OR 3000 hits (H) but have not (yet?) been inducted into the Baseball Hall of Fame?
First we will load the Lahman package
library(Lahman)Next, we will inspect the data. By looking at our dataset structures and conducting some anti-joins between tables.
head(HallOfFame)## playerID yearID votedBy ballots needed votes inducted category needed_note
## 1 cobbty01 1936 BBWAA 226 170 222 Y Player <NA>
## 2 ruthba01 1936 BBWAA 226 170 215 Y Player <NA>
## 3 wagneho01 1936 BBWAA 226 170 215 Y Player <NA>
## 4 mathech01 1936 BBWAA 226 170 205 Y Player <NA>
## 5 johnswa01 1936 BBWAA 226 170 189 Y Player <NA>
## 6 lajoina01 1936 BBWAA 226 170 146 N Player <NA>
head(Batting)## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB SO
## 1 abercda01 1871 1 TRO NA 1 4 0 0 0 0 0 0 0 0 0 0
## 2 addybo01 1871 1 RC1 NA 25 118 30 32 6 0 0 13 8 1 4 0
## 3 allisar01 1871 1 CL1 NA 29 137 28 40 4 5 0 19 3 1 2 5
## 4 allisdo01 1871 1 WS3 NA 27 133 28 44 10 2 2 27 1 1 0 2
## 5 ansonca01 1871 1 RC1 NA 25 120 29 39 11 3 0 16 6 2 2 1
## 6 armstbo01 1871 1 FW1 NA 12 49 9 11 2 1 0 5 0 1 0 1
## IBB HBP SH SF GIDP
## 1 NA NA NA NA 0
## 2 NA NA NA NA 0
## 3 NA NA NA NA 1
## 4 NA NA NA NA 0
## 5 NA NA NA NA 0
## 6 NA NA NA NA 0
head(People)## playerID birthYear birthMonth birthDay birthCountry birthState birthCity
## 1 aardsda01 1981 12 27 USA CO Denver
## 2 aaronha01 1934 2 5 USA AL Mobile
## 3 aaronto01 1939 8 5 USA AL Mobile
## 4 aasedo01 1954 9 8 USA CA Orange
## 5 abadan01 1972 8 25 USA FL Palm Beach
## 6 abadfe01 1985 12 17 D.R. La Romana La Romana
## deathYear deathMonth deathDay deathCountry deathState deathCity nameFirst
## 1 NA NA NA <NA> <NA> <NA> David
## 2 2021 1 22 USA GA Atlanta Hank
## 3 1984 8 16 USA GA Atlanta Tommie
## 4 NA NA NA <NA> <NA> <NA> Don
## 5 NA NA NA <NA> <NA> <NA> Andy
## 6 NA NA NA <NA> <NA> <NA> Fernando
## nameLast nameGiven weight height bats throws debut finalGame
## 1 Aardsma David Allan 215 75 R R 2004-04-06 2015-08-23
## 2 Aaron Henry Louis 180 72 R R 1954-04-13 1976-10-03
## 3 Aaron Tommie Lee 190 75 R R 1962-04-10 1971-09-26
## 4 Aase Donald William 190 75 R R 1977-07-26 1990-10-03
## 5 Abad Fausto Andres 184 73 L L 2001-09-10 2006-04-13
## 6 Abad Fernando Antonio 235 74 L L 2010-07-28 2019-09-28
## retroID bbrefID deathDate birthDate
## 1 aardd001 aardsda01 <NA> 1981-12-27
## 2 aaroh101 aaronha01 2021-01-22 1934-02-05
## 3 aarot101 aaronto01 1984-08-16 1939-08-05
## 4 aased001 aasedo01 <NA> 1954-09-08
## 5 abada001 abadan01 <NA> 1972-08-25
## 6 abadf001 abadfe01 <NA> 1985-12-17
Batting %>%
anti_join(x=Batting, y=People, by = "playerID") %>% head()## [1] playerID yearID stint teamID lgID G AB R
## [9] H X2B X3B HR RBI SB CS BB
## [17] SO IBB HBP SH SF GIDP
## <0 rows> (or 0-length row.names)
Batting %>%
anti_join(HallOfFame, by = "playerID") %>% head()## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB SO
## 1 abercda01 1871 1 TRO NA 1 4 0 0 0 0 0 0 0 0 0 0
## 2 addybo01 1871 1 RC1 NA 25 118 30 32 6 0 0 13 8 1 4 0
## 3 allisar01 1871 1 CL1 NA 29 137 28 40 4 5 0 19 3 1 2 5
## 4 armstbo01 1871 1 FW1 NA 12 49 9 11 2 1 0 5 0 1 0 1
## 5 barkeal01 1871 1 RC1 NA 1 4 0 1 0 0 0 2 0 0 1 0
## 6 barrebi01 1871 1 FW1 NA 1 5 1 1 1 0 0 1 0 0 0 0
## IBB HBP SH SF GIDP
## 1 NA NA NA NA 0
## 2 NA NA NA NA 0
## 3 NA NA NA NA 1
## 4 NA NA NA NA 0
## 5 NA NA NA NA 0
## 6 NA NA NA NA 0
HallOfFame %>%
anti_join(Batting, by = "playerID") %>% head()## playerID yearID votedBy ballots needed votes inducted category
## 1 bulkemo99 1937 Centennial NA NA NA Y Pioneer/Executive
## 2 johnsba99 1937 Centennial NA NA NA Y Pioneer/Executive
## 3 cartwal99 1938 Centennial NA NA NA Y Pioneer/Executive
## 4 chadwhe99 1938 Centennial NA NA NA Y Pioneer/Executive
## 5 mccarjo99 1939 BBWAA 274 206 3 N Manager
## 6 landike99 1944 Old Timers NA NA NA Y Pioneer/Executive
## needed_note
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 <NA>
These 3 tables are connected with the same playerID key. There are playerIDs in Batting & People that are not in HallofFame. We will assume that these individuals they have all not been inducted into the HallofFame. Thus, we will join the 3 tables together and make a list of players who have hit 500 home runs or 3000, but have not been inducted into the Hall of Fame.
Batting %>%
full_join(People, by = "playerID") %>%
full_join(HallOfFame, by = "playerID") %>%
filter(category == "Player") %>%
filter(inducted != "Y") %>%
filter(HR > 500 | H > 3000) %>%
select(nameFirst, nameLast, HR, H, inducted)## [1] nameFirst nameLast HR H inducted
## <0 rows> (or 0-length row.names)
There are no players who hit > 500 home runs or >3000 hits that have not been inducted into the HallOfFame database.
Text Data
Question: How many speaking lines are there in Macbeth? Speaking lines are identified by a line that starts with two spaces, then a string of capital letters (possibly including spaces) indicating the character’s name, followed by a period.
I will start by loading in the Macbeth file and parse the lines into separate rows.
library(tidyverse)
macbeth_url <- "http://www.gutenberg.org/cache/epub/1129/pg1129.txt"
data(Macbeth_raw, package = "mdsr")
#Parse lines into separate rows
macbeth <- Macbeth_raw %>%
stringi::stri_split_lines() %>%
pluck(1)
length(macbeth)## [1] 3194
head(macbeth)## [1] "This Etext file is presented by Project Gutenberg, in"
## [2] "cooperation with World Library, Inc., from their Library of the"
## [3] "Future and Shakespeare CDROMS. Project Gutenberg often releases"
## [4] "Etexts that are NOT placed in the Public Domain!!"
## [5] ""
## [6] "*This Etext has certain copyright implications you should read!*"
Next, I will create a regex pattern and detect the regex pattern using str_subset & str_detect. I will show both the patterns detected and the full lines that contain the detected patterns.
pattern <- "^ [A-Z ]+[.]{1} "
#show the regex pattern detected to check
macbeth %>% str_subset(pattern) %>% str_extract(pattern) %>% head(10)## [1] " FIRST WITCH. " " SECOND WITCH. " " THIRD WITCH. " " FIRST WITCH. "
## [5] " SECOND WITCH. " " THIRD WITCH. " " FIRST WITCH. " " ALL. "
## [9] " DUNCAN. " " MALCOLM. "
#show the lines detected to check
macbeth %>% str_subset(pattern) %>% head(10)## [1] " FIRST WITCH. When shall we three meet again?"
## [2] " SECOND WITCH. When the hurlyburly's done,"
## [3] " THIRD WITCH. That will be ere the set of sun."
## [4] " FIRST WITCH. Where the place?"
## [5] " SECOND WITCH. Upon the heath."
## [6] " THIRD WITCH. There to meet with Macbeth."
## [7] " FIRST WITCH. I come, Graymalkin."
## [8] " ALL. Paddock calls. Anon!"
## [9] " DUNCAN. What bloody man is that? He can report,"
## [10] " MALCOLM. This is the sergeant"
Now that I have verified that our regex pattern is working correctly, I will compute the total number of lines containing the pattern.
macbeth %>% str_detect(pattern) %>% sum()## [1] 644
There are 644 speaking lines in Macbeth.
Note: I included a forced space at the end of the regex pattern because there are two lines that fit the regex pattern but were not speaking lines. In the following code chunk, I displayed the two faulty lines using the regex pattern without a forced space at the end.
pattern2 <- "^ [A-Z ]+[.]{1}"
macbeth %>% str_subset(pattern2) %>% head(2)## [1] " ASCII." " P.O. Box 2782"
Question: Find the 10 most popular boys’ names in 2017 that end in a vowel. Use the babynames::babynames table. (Hint: str_detect.)
First I will load the babynames package.
library(babynames)
head(babynames::babynames)## # A tibble: 6 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 F Anna 2604 0.0267
## 3 1880 F Emma 2003 0.0205
## 4 1880 F Elizabeth 1939 0.0199
## 5 1880 F Minnie 1746 0.0179
## 6 1880 F Margaret 1578 0.0162
We want only the names of boys in 2017, so we will filter for this.
vowelboys <- babynames::babynames %>%
filter(year == 2017 & sex == "M")
head(vowelboys)## # A tibble: 6 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 2017 M Liam 18728 0.00954
## 2 2017 M Noah 18326 0.00933
## 3 2017 M William 14904 0.00759
## 4 2017 M James 14232 0.00725
## 5 2017 M Logan 13974 0.00712
## 6 2017 M Benjamin 13733 0.00699
We will want a single vector for this.
vowelboys2 <- vowelboys$nameNow, we will use a regex pattern to detect all the names that end with a vowel. We will use str_subset to first check if it is working correctly.
regex <- "[a,e,i,o,u]$"
vowelboys2 %>% str_subset(regex) %>% head()## [1] "Luke" "Levi" "Joshua" "Mateo" "Eli" "Leo"
Now that our regex pattern is working correctly, we will use str_detect to add a column to our dataset to indicate if the name ends with a vowel. Then, we will filter for only “TRUE” values, arrange the dataset by descending occurences (n column), and take the top 10 rows. This will give us the top 10 most popular boys names in 2017 that end in a vowel.
top10vowelboys <-vowelboys %>%
mutate(endVowel = name %>% str_detect(regex)) %>%
filter(endVowel == "TRUE") %>%
arrange(desc(n)) %>%
head(10)The top 10 most popular boys names ending with a vowel in 2017 are Luke, Levi, Joshua, Mateo, Eli, Leo, Theodore, Ezra, Jose, Jace.